Skip to content

Conversation

sarakodeiri
Copy link
Collaborator

@sarakodeiri sarakodeiri commented Sep 16, 2025

Short Description

Clickup Ticket(s): (https://app.clickup.com/t/868fm2hg6)

Both choices of meta classifiers can be trained and saved as pickle files. Predictions can also be done and evaluated with using TPR@FPR. The PR includes preprocessing methods for getting training features as well (e.g. Gower distance, the DOMIAS method.)

Things to note:

  • The pipeline has been tested with the data provided by the original repository which is very small in size. Other thorough tests are required when all parts of the pipeline are finalized.
  • The original repository has an "iteration" count and doesn't use it properly. The same has been implemented here.
  • This implementation is loyal to the original repository about DOMIAS as well, but it might be better if DOMIAS is used only for non-categorical data, which is something to be tested after the pipeline has been finalized as well.

Tests Added

Tests have been added for appropriate model fitting, data preprocessing, and the prediction function.

Copy link
Collaborator

@fatemetkl fatemetkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice implementation of the blending++ pipeline! I also double-checked your implementation against the original ensemble codebase, and it matches perfectly 🙂.

I’ve left some comments, and I’ll address the ones where you’ve tagged me.
One suggestion (feel free to ignore): to clarify the different dataframes in this pipeline, in addition to docstrings, we could also reference the descriptions and diagrams in the example README.

trans_json_file_path: ${base_example_dir}/data_configs/trans.json
population_sample_size: 40000

# Metadata for real data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good idea, but can we create a trans_metadata.json or real_metadata.json file to store this information? I am suggesting this because we have several data config files like trans_domain.json and info.json with similar type of metadata information. We can keep this config.yaml for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path and load it in BlendingPlusPlus init.

log(INFO, "Data processing pipeline finished.")
if config.pipeline.run_data_processing:
run_data_processing(config)
elif config.pipeline.run_metaclassifier_training:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just have another if here instead of elif since someone might want to run data processing as well as meta classifier training.

Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
)

df_synth = load_dataframe(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the synthetic data generated from the TabDDPM model trained on the df_real_train data from saved at Path(processed_attack_data_path, "real_train.csv")? Maybe one comment to say where this synth data is coming from since we still have not added the code for that would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this synthetic data has come from the original repository because I just wanted to test that it would run. I did the same with the RMIA signals file, only using them as placeholders. I'll add the comment for future reference, though. Thanks!


# 3. Get RMIA signals (placeholder)
rmia_signals = pd.read_csv(
"examples/ensemble_attack/data/attack_data/og_rmia_train_meta_pred.csv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is temporary because we still don't have the RMIA computation pipeline, but can we get this path from config.yaml instead of fixing it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it like this for now because I'm not passing config.yaml to blending.py and don't plan on doing it. It will be resolved soon though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Makes sense 👍


study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(n_startup_trials=10, seed=np.random.randint(1000)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is the random seed used in the original code, but as David suggested we can make it an option for the class and maybe pass here the user specified random_seed (it could be the one user sets in config.yaml).

colsample_bytree=trial.suggest_float("colsample_bylevel", 0.5, 1),
reg_alpha=trial.suggest_categorical("reg_alpha", [0, 0.1, 0.5, 1, 5, 10]),
reg_lambda=trial.suggest_categorical("reg_lambda", [0, 0.1, 0.5, 1, 5, 10, 100]),
tree_method="auto",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, was there an issue with tree_method="auto" if not use_gpu else "gpu_hist", ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I wanted it to run on my local without taking up GPU space. Should've just changed use_gpu to false in config.yaml :)

.gitignore Outdated
# Trained metaclassifiers
examples/ensemble_attack/trained_models/
examples/ensemble_attack/attack_results/
examples/ensemble_attack_example/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have this last one anymore

"categorical": ["trans_type", "operation", "k_symbol", "bank"]
"variable_to_predict": "trans_type"

col_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am missing where col_type and bounds are used. Maybe adding some comments here would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original codebase, bounds are used when they're training the XGBoost model to preprocess categorical columns, while the meta classifier features don't include them at all. I added them just in case, but didn't use them.

As for "col_type", you're right and they haven't been used yet. Not sure if they ever will.

I'll remove both for now and add them as needed.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, thanks for addressing the comments! Approving with just some minor ones now.

As others have commented in this PR as well, please wait for their approval or make sure to address all their comments.

"clavaddpm_black_box",
"clavaddpm_white_box",
]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: no need to add this space here, it messes up the indentation.

process_split_data(
all_population_data=population_data,
processed_attack_data_path=Path(config.data_paths.processed_attack_data_path),
# TODO: column_to_stratify value is not documented in the original codebase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not true anymore? I see docstrings in that function right now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by "original codebase", @fatemetkl meant the submission repository (link), but I'm not sure what the "TODO" is for. My guess is we test things with stratified columns specified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
The "original codebase" in this comment refers to the attack submission. I should have added a link. So sorry for the confusion! I will fix it in my PR.

As Sara mentioned, since this parameter wasn’t documented in the original attack codebase, I added a TODO to experiment with other columns in case the one I specified isn’t the correct one.

Copy link
Collaborator

@emersodb emersodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the edits and work towards addressing the comments. Most of my previous comments have been address really nicely. I have a few additional small ones and I left a few comments of mine unresolved, as I think a few small changes remain before they are fully address. Let me know if you have questions on any comments!

@@ -1,70 +1,96 @@
# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo
"""Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than just referring to the "submission repo" I would add a link to said repo here if possible.

lr_model = LogisticRegression(max_iter=1000)
self.meta_classifier_ = lr_model.fit(meta_features, y_train)

else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to keep this error if something weird happens (i.e. we expand the enum but forget to update this).

from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr


optuna.logging.set_verbosity(optuna.logging.WARNING)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a comment here why you're setting this?

def __init__(
self,
input_features: pd.DataFrame,
label: np.ndarray,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super minor, maybe call this labels instead of label?

("preprocessing", preprocessing),
(
"xgboost",
xgb.XGBClassifier(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment remains relevant? We have a use_gpu argument, but it isn't making it's way into the class?


def _evaluate_pipeline_cv(self, trial: Trial, num_kfolds: int) -> float:
"""
Performs cross-validation on the pipeline and returns the mean TPR at a fixed FPR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is doing it at a specific FPR threshold (0.1) and it isn't configurable. So I think the docs need to mention this here, and in the return value.

Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
)

# Synthetic data borrowed from the attack implementation repository.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link to where this was borrowed from.

Copy link
Collaborator

@fatemetkl fatemetkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments. Most of my comments are addressed, except for a few minor ones about the attack example config, the DOMIAS documentation, and the extra line in .gitignore, I believe. Happy to discuss 🙂.

@sarakodeiri sarakodeiri requested a review from emersodb October 3, 2025 18:23
Copy link
Collaborator

@fatemetkl fatemetkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! 🚀

@sarakodeiri sarakodeiri merged commit d8fc981 into main Oct 6, 2025
5 of 6 checks passed
@sarakodeiri sarakodeiri deleted the sk/meta_classifier branch October 6, 2025 16:06
bzamanlooy pushed a commit that referenced this pull request Oct 10, 2025
c83ecea Trainer: Refactoring process_pipeline_data function (#54)
2be0f81 Bump astral-sh/setup-uv from 6.7.0 to 6.8.0 (#56)
004fc67 Trainer: general variable renamings on data_loaders.py and adding a couple more enums (#53)
4c2d75f Trainer: Refactoring the _pair_clustering_keep_id (#52)
d8fc981 Ensemble attack: Meta classifier pipeline  (#37)

git-subtree-dir: deps/midst-toolkit
git-subtree-split: c83ecea
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants